Encoding method for the compression of a video sequence

ABSTRACT

The invention relates to an encoding method for the compression of a video sequence. Said method, using a three-dimensional wavelet transform, is based on a hierarchical subband coding process in which the subbands to be encoded are scanned in an order that preserves the initial subband structure of the 3D wavelet transform. According to the invention, a temporal (resp. spatial) scalability is obtained by performing a motion estimation at each temporal resolution level (resp. at the highest spatial resolution level), and only the part of the estimated motion vectors necessary to reconstruct any given temporal (resp. spatial) resolution level is then encoded and put in the bitstream together with the bits encoding the wavelet coefficients at this given temporal (resp. spatial) level, said insertion in the bitstream being done before encoding texture coefficients at the same temporal (resp. spatial) level. Such a solution avoids to encode and send all the motion vector fields in the bitstream, which would be a drawback when a low bitrate is targeted and the receiver only wants a reduced frame rate or spatial resolution.

The present invention relates to an encoding method for the compressionof a video sequence divided into groups of frames and decomposed bymeans of a three-dimensional (3D) wavelet transform leading to a givennumber of successive resolution levels that correspond to thedecomposition levels of said transform, said method being based on ahierarchical subband encoding process leading from the original set ofpicture elements (pixels) of each group of frames to transformcoefficients constituting a hierarchical pyramid, a spatio-temporalorientation tree—in which the roots are formed with the pixels of theapproximation subband resulting from the 3D wavelet transform and theoffspring of each of these pixels is formed with the pixels of thehigher subbands corresponding to the image volume defined by these rootpixels—defining the spatio—temporal relationship inside saidhierarchical pyramid, the subbands to be encoded being scanned one afterthe other in an order that respects the parent-offspring dependenciesformed in said tree and preserves the initial subband structure of the3D wavelet transform.

The video streaming over heterogeneous networks requires a highscalability capability. That means that parts of a bitstream can bedecoded without a complete decoding of the sequence and can be combinedto reconstruct the initial video information at lower spatial ortemporal resolutions (spatial/temporal scalability) or with a lowerquality (PSNR scalability). A convenient way to achieve all the threetypes of scalability is a three-dimensional (3D) wavelet decompositionof the motion compensated video sequence.

In a previous European patent application filed by the Applicant on May3, 2000, with the number 00401216.7 (PHFR000044), a simple method oftexture coding having this property has been described. In that method,as well as in other published documents (such as for instance, in “Anembedded wavelet video coder using three-dimensional set partitioning inhierarchical trees (SPIHT)”, by B. Kim and W. A. Pearlman, ProceedingsDCC'97, Data Compression Conference, Snowbird, Utah, U.S.A., 25-27, Mar.1997, pp.251-260), all the motion vector fields are encoded and sent inthe bitstream, which may become a major drawback when a low bitrate istargeted and the receiver only wants a reduced frame rate or spatialresolution.

It is therefore an object of the invention to propose an encoding methodmore adapted to the situation where a high scalability must be obtained.

To this end, the invention relates to an encoding method such as definedin the introductory part of the description and which is moreovercharacterized in that, in view of a temporal scalability, a motionestimation is performed at each temporal resolution level, the beginningof which is indicated by flags inserted into the bitstream, and only theestimated motion vectors necessary to reconstruct any given temporalresolution level are encoded and put in the bitstream together with thebits encoding the wavelet coefficients at this given temporal level,said motion vectors being inserted into said bitstream before encodingtexture coefficients at the same temporal level.

In another embodiment, the invention also relates to an encoding methodsuch as defined in said introductory part and which is characterized inthat, in view of a spatial scalability, a motion estimation is performedat the highest spatial resolution level, the vectors thus obtained beingdivided by two in order to obtain the motion vectors for the lowerspatial resolutions, and only the estimated motion vectors necessary toreconstruct any spatial resolution level are encoded and put in thebitstream together with the bits encoding the wavelet coefficients atthis given spatial level, said motion vectors being inserted into saidbitstream before encoding texture coefficients at the same spatiallevel, and said encoding operation being carried out on the motionvectors at the lowest spatial resolution, only refinement bits at eachspatial resolution being then put in the bitstream bitplane by bitplane,from one resolution level to the other.

The technical solution thus proposed allows to encode only the motionvectors corresponding to the desired frame rate or spatial resolution,instead of sending all the motion vectors corresponding to all possibleframe rates and all spatial resolution levels.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will now be described, by way of example, withreference to the accompanying drawings in which:

FIG. 1 illustrates a temporal subband decomposition of the videoinformation with motion compensation using the Haar multiresolutionanalysis;

FIG. 2 shows the spatio-temporal subbands resulting from athree-dimensional wavelet decomposition;

FIG. 3 illustrates the motion vector insertion in the bitstream fortemporal scalability;

FIG. 4 shows the structure of the bitstream obtained with a temporallydriven scanning of the spatio-temporal tree;

FIG. 5 is a binary representation of a motion vector and its progressivetransmission from the lowest resolution to the highest;

FIG. 6 shows the bitstream organization for motion vector coding in theproposed scalable approach.

DETAILED DESCRIPTION OF THE INVENTION

A temporal subband decomposition of a video sequence is shown in FIG. 1.The illustrated 3D wavelet decomposition with motion compensation isapplied to a group of frames (GOF), referenced F1 to F8. In this 3Dsubband decomposition scheme, each GOF of the input video is firstmotion-compensated (MC in FIG. 1) (this step allows to process sequenceswith large motion) and then temporally filtered using Haar wavelets (thedotted arrows correspond to a high-pass temporal filtering, while theother ones correspond to a low-pass temporal filtering) and after thesetwo operations, each temporal subband is spatially decomposed into aspatio-temporal subband, which leads to a 3D wavelet representation ofthe original GOF, as illustrated in FIG. 2. In FIG. 1, three stages ofdecomposition are shown (L and H=first stage; LL and LH=second stage ;LLL and LLH=third stage). At each temporal decomposition level of theillustrated group of 8 frames, a group of motion vector fields isgenerated (MV4 at the first level, MV3 at the second one, MV2 at thethird one). When a Haar multiresolution analysis is used for thetemporal decomposition, since one motion vector field is generatedbetween every two frames in the considered group of frames at eachtemporal decomposition level, the number of motion vector fields isequal to half the number of frames in the temporal subband, i.e. four atthe first level of motion vector fields, two at the second one, and oneat the third one. At the decoder side, in order to reconstruct a giventemporal level, only the motion vector fields at that level and at thelower temporal resolutions (reduced frame rate) are needed.

(A) Temporal Scalability

This observation leads, according to the invention, to organize thebitstream in a way that allows a progressive decoding, as described forexample in FIG. 3: three temporal decomposition levels TDL (as shown inFIG. 1) yield four temporal resolution levels (1 to 4), which representthe possible frame rates that can be obtained from the initial framerate. The coefficients corresponding to the lowest resolution temporallevel are first encoded, without sending motion vectors at this level,and, for all the other reconstruction frame rates, the motion vectorfields and the frames of the corresponding high frequency temporalsubband are encoded. This description of the bitstream organization upto now only takes into account the temporal levels. However, for acomplete scalability, one has to consider the spatial scalability insideeach temporal level. The solution for wavelet coefficients was describedin the European patent application already cited, and it is reminded inFIG. 4: inside each temporal scale, all the spatial resolutions aresuccessively scanned (SDL=spatial decomposition levels), and thereforeall the spatial frequencies are available (frame rates t=1 to 4; displaysizes s=1 to 4). The upper flags separate two bitplanes, and the lowerones two temporal decomposition levels.

(B) Spatial Scalability

In order to be able to reconstruct a reduced spatial resolution video,it is not desirable to transmit at the beginning of the bitstream themotion vector fields of full resolution. Indeed, it is necessary toadapt the motion described by the motion vectors to the size of thecurrent spatial level. Ideally, it would be desirable to have first alow resolution motion vector field corresponding to the lowest spatialresolution and then to be able to progressively increase the resolutionof the motion vectors according to the increase in the spatialresolution. Only the difference from a motion vector field resolution toanother one would be encoded and transmitted.

It will be assumed that the motion estimation is performed by means of ablock-based method like full search block matching or any other derivedsolution, with an integer pixel precision on full resolution frames(this hypothesis does not reduce the generality of the problem: if onewants to work with half-pixel precision for motion vectors, bymultiplying by 2 all the motion vectors at the beginning, one returns inthe previous case of integer vectors, even though they will representfractional displacements). Thus, motion vectors are represented byintegers. Given the full resolution motion vector field, in order tosatisfy the above requirements of spatial scalability, the motion vectorresolution is reduced by a simple divide-by-2 operation. Indeed, as thespatial resolution of the approximation subband is reduced by a factor2, while the motion is the same as in the full resolution subband, thedisplacements will be reduced by a factor 2. This division isimplemented for integers by a simple shift.

The size of the blocks in the motion estimation must be chosencarefully: indeed, if the original size of the block is 8×8 in the fullresolution, it will become 4×4 in the half resolution, then 2×2 in thequarter, and so on. A problem will therefore appear if the original sizeof the blocks is too small: the size can be null for small spatialresolutions. Thus it must be checked that the original size iscompatible with the number of decomposition/reconstruction levels.

It is now assumed that one has S spatial decomposition levels and thatone wants the motion vectors corresponding to all possible resolutions,from the lowest to the highest. Then, either the initial motion vectorsare divided by 2^(S) or a shift of S positions is performed. The resultrepresents the motion vectors corresponding to the blocks from lowestresolution whom the size is divided by 2^(S). A division by 2^(S−1) ofthe original motion vector would provide the next spatial resolution.But this value is already available from the previous operation. Indeed,it corresponds to a shift of S−1 positions. The difference from thefirst operation is the bit in the binary representation of the motionvector with a weight of 2^(S−1). It is then sufficient to add this bit(the refinement bit) to the previously transmitted vector to reconstructthe motion vector at a higher resolution, which is illustrated in FIG. 5for S=4. This progressive transmission of the motion vectors allows toinclude in the bitstream the refinement bits of the motion vector fieldsfrom one spatial resolution to another just before the bitscorresponding to the texture at the same spatial level. The proposedmethod is resumed in FIG. 6.

The motion vectors at the lowest resolution are encoded with a DPCMtechnique followed by entropy coding using usual VLC tables (e.g., thoseused in MPEG-4). For the other resolution levels, a complete bitplanecomposed of the refinement bits of the motion vector field has to beencoded, for instance by means of a contextual arithmetic encoding, withthe context depending on the horizontal or vertical component of themotion vector.

The part of the bitstream representing motion vectors precedes anyinformation concerning the texture. The difference with respect to a“classical” non-scalable approach is that the hierarchy of temporal andspatial levels is transposed to the motion vector coding. The mostsignificant improvement with respect to the previous technique is thatthe motion information can be decoded progressively. For a given spatialresolution, the decoder does not have to decode parts of the bitstreamthat are not useful at that level.

1. An encoding method for the compression of a video sequence dividedinto groups of frames and decomposed by means of a three-dimensional(3D) wavelet transform leading to a given number of successiveresolution levels that correspond to the decomposition levels of saidtransform, said method being based on a hierarchical subband encodingprocess leading from the original set of picture elements (pixels) ofeach group of frames to transform coefficients constituting ahierarchical pyramid, a spatio-temporal orientation tree—in which theroots are formed with the pixels of the approximation subband resultingfrom the 3D wavelet transform and the offspring of each of these pixelsis formed with the pixels of the higher subbands corresponding to theimage volume defined by these root pixels—defining the spatio-temporalrelationship inside said hierarchical pyramid, the subbands to beencoded being scanned one after the other in an order that respects theparent-offspring dependencies formed in said tree and preserves theinitial subband structure of the 3D wavelet transform, said method beingfurther characterized in that, in view of a temporal scalability, amotion estimation is performed at each temporal resolution level, thebeginning of which is indicated by flags inserted into the bitstream,and only the estimated motion vectors necessary to reconstruct any giventemporal resolution level are encoded and put in the bitstream togetherwith the bits encoding the wavelet coefficients at this given temporallevel, said motion vectors being inserted into said bitstream beforeencoding texture coefficients at the same temporal level.
 2. An encodingmethod for the compression of a video sequence divided into groups offrames and decomposed by means of a three-dimensional (3D) wavelettransform leading to a given number of successive resolution levels thatcorrespond to the decomposition levels of said transform, said methodbeing based on a hierarchical subband encoding process leading from theoriginal set of picture elements (pixels) of each group of frames totransform coefficients constituting a hierarchical pyramid, aspatio-temporal orientation tree—in which the roots are formed with thepixels of the approximation subband resulting from the 3D wavelettransform and the offspring of each of these pixels is formed with thepixels of the higher subbands corresponding to the image volume definedby these root pixels—defining the spatio-temporal relationship insidesaid hierarchical pyramid, the subbands to be encoded being scanned oneafter the other in an order that respects the parent-offspringdependencies formed in said tree and preserves the initial subbandstructure of the 3D wavelet transform, said method being furthercharacterized in that, in view of a spatial scalability, a motionestimation is performed at the highest spatial resolution level, thevectors thus obtained being divided by two in order to obtain the motionvectors for the lower spatial resolutions, and only the estimated motionvectors necessary to reconstruct any spatial resolution level areencoded and put in the bitstream together with the bits encoding thewavelet coefficients at this given spatial level, said motion vectorsbeing inserted into said bitstream before encoding texture coefficientsat the same spatial level, and said encoding operation being carried outon the motion vectors at the lowest spatial resolution, only refinementbits at each spatial resolution being then put in the bitstream bitplaneby bitplane, from one resolution level to the other.